In this chapter, we will show how we use the skrub TableReport to explore tabular data. We will use the Adult Census dataset as our example table, and perform some exploratory analysis to learn about the characteristics of the data.
First, let’s import the necessary libraries and load the dataset.
import pandas as pdfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import fetch_openml# Load the Adult Census dataset from OpenMLadult = fetch_openml(name='adult', version=2, as_frame=True)data = adult.datatarget = adult.target
Now that we have a dataframe we can work with, we would like to find out:
The size of the dataset.
The data types and names of the columns.
The distribution of values in the columns.
Whether null values are present, in what measure and where.
Discrete/categorical features, and their cardinality.
Columns strongly correlated with each other.
3.2 Exploring data with Pandas tools
For the sake of the example, let’s first explore the data using Pandas only.
We can get an idea of the content of the table by printing the first few lines, which gives an idea of the datatypes and the columns we are dealing with.
data.head(5)
age
workclass
fnlwgt
education
education-num
marital-status
occupation
relationship
race
sex
capital-gain
capital-loss
hours-per-week
native-country
0
25
Private
226802
11th
7
Never-married
Machine-op-inspct
Own-child
Black
Male
0
0
40
United-States
1
38
Private
89814
HS-grad
9
Married-civ-spouse
Farming-fishing
Husband
White
Male
0
0
50
United-States
2
28
Local-gov
336951
Assoc-acdm
12
Married-civ-spouse
Protective-serv
Husband
White
Male
0
0
40
United-States
3
44
Private
160323
Some-college
10
Married-civ-spouse
Machine-op-inspct
Husband
Black
Male
7688
0
40
United-States
4
18
NaN
103497
Some-college
10
Never-married
NaN
Own-child
White
Female
0
0
30
United-States
If we want to have a simpler view of the datatypes in the dataframe, we must use data.info():
With .info() we can find out the shape of the dataframe (the number of rows and columns), the datatype and the number of non-null values for each column.
We can also get a richer summary of the data with the .describe() method:
data.describe(include="all")
age
workclass
fnlwgt
education
education-num
marital-status
occupation
relationship
race
sex
capital-gain
capital-loss
hours-per-week
native-country
count
48842.000000
46043
4.884200e+04
48842
48842.000000
48842
46033
48842
48842
48842
48842.000000
48842.000000
48842.000000
47985
unique
NaN
8
NaN
16
NaN
7
14
6
5
2
NaN
NaN
NaN
41
top
NaN
Private
NaN
HS-grad
NaN
Married-civ-spouse
Prof-specialty
Husband
White
Male
NaN
NaN
NaN
United-States
freq
NaN
33906
NaN
15784
NaN
22379
6172
19716
41762
32650
NaN
NaN
NaN
43832
mean
38.643585
NaN
1.896641e+05
NaN
10.078089
NaN
NaN
NaN
NaN
NaN
1079.067626
87.502314
40.422382
NaN
std
13.710510
NaN
1.056040e+05
NaN
2.570973
NaN
NaN
NaN
NaN
NaN
7452.019058
403.004552
12.391444
NaN
min
17.000000
NaN
1.228500e+04
NaN
1.000000
NaN
NaN
NaN
NaN
NaN
0.000000
0.000000
1.000000
NaN
25%
28.000000
NaN
1.175505e+05
NaN
9.000000
NaN
NaN
NaN
NaN
NaN
0.000000
0.000000
40.000000
NaN
50%
37.000000
NaN
1.781445e+05
NaN
10.000000
NaN
NaN
NaN
NaN
NaN
0.000000
0.000000
40.000000
NaN
75%
48.000000
NaN
2.376420e+05
NaN
12.000000
NaN
NaN
NaN
NaN
NaN
0.000000
0.000000
45.000000
NaN
max
90.000000
NaN
1.490400e+06
NaN
16.000000
NaN
NaN
NaN
NaN
NaN
99999.000000
4356.000000
99.000000
NaN
This gives us useful information about all the features in the dataset. Among others, we can find the number of unique values in each column, various statistics for the numerical columns and the number of null values.
3.3 Exploring data with the TableReport
Now, let’s create a TableReport to explore the dataset.
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Is sorted
Null values
Unique values
Mean
Std
Min
Median
Max
0
age
Int64DType
False
0 (0.0%)
74 (0.2%)
38.6
13.7
17
37
90
1
workclass
CategoricalDtype
False
2799 (5.7%)
8 (< 0.1%)
2
fnlwgt
Int64DType
False
0 (0.0%)
28523 (58.4%)
1.90e+05
1.06e+05
12,285
178,142
1,490,400
3
education
CategoricalDtype
False
0 (0.0%)
16 (< 0.1%)
4
education-num
Int64DType
False
0 (0.0%)
16 (< 0.1%)
10.1
2.57
1
10
16
5
marital-status
CategoricalDtype
False
0 (0.0%)
7 (< 0.1%)
6
occupation
CategoricalDtype
False
2809 (5.8%)
14 (< 0.1%)
7
relationship
CategoricalDtype
False
0 (0.0%)
6 (< 0.1%)
8
race
CategoricalDtype
False
0 (0.0%)
5 (< 0.1%)
9
sex
CategoricalDtype
False
0 (0.0%)
2 (< 0.1%)
10
capital-gain
Int64DType
False
0 (0.0%)
123 (0.3%)
1.08e+03
7.45e+03
0
0
99,999
11
capital-loss
Int64DType
False
0 (0.0%)
99 (0.2%)
87.5
403.
0
0
4,356
12
hours-per-week
Int64DType
False
0 (0.0%)
96 (0.2%)
40.4
12.4
1
40
99
13
native-country
CategoricalDtype
False
857 (1.8%)
41 (< 0.1%)
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box), select some columns using the checkboxes next
to the column names or the "Select all" button.
The table below shows the strength of association between the most similar columns in the dataframe.
Cramér's V statistic is a number between 0 and 1.
When it is close to 1 the columns are strongly associated — they contain similar information.
In this case, one of them may be redundant and for some models (such as linear models) it might be beneficial to remove it.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
3.3.1 Default view of the TableReport
The TableReport gives us a comprehensive overview of the dataset. The default view shows all the columns in the dataset, and allows to select and copy the content of the cells shown in the preview.
The TableReport is intended to show a preview of the data, so it does not contain all the rows in the dataset, rather it shows only the first and last few rows by default.
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Is sorted
Null values
Unique values
Mean
Std
Min
Median
Max
0
age
Int64DType
False
0 (0.0%)
74 (0.2%)
38.6
13.7
17
37
90
1
workclass
CategoricalDtype
False
2799 (5.7%)
8 (< 0.1%)
2
fnlwgt
Int64DType
False
0 (0.0%)
28523 (58.4%)
1.90e+05
1.06e+05
12,285
178,142
1,490,400
3
education
CategoricalDtype
False
0 (0.0%)
16 (< 0.1%)
4
education-num
Int64DType
False
0 (0.0%)
16 (< 0.1%)
10.1
2.57
1
10
16
5
marital-status
CategoricalDtype
False
0 (0.0%)
7 (< 0.1%)
6
occupation
CategoricalDtype
False
2809 (5.8%)
14 (< 0.1%)
7
relationship
CategoricalDtype
False
0 (0.0%)
6 (< 0.1%)
8
race
CategoricalDtype
False
0 (0.0%)
5 (< 0.1%)
9
sex
CategoricalDtype
False
0 (0.0%)
2 (< 0.1%)
10
capital-gain
Int64DType
False
0 (0.0%)
123 (0.3%)
1.08e+03
7.45e+03
0
0
99,999
11
capital-loss
Int64DType
False
0 (0.0%)
99 (0.2%)
87.5
403.
0
0
4,356
12
hours-per-week
Int64DType
False
0 (0.0%)
96 (0.2%)
40.4
12.4
1
40
99
13
native-country
CategoricalDtype
False
857 (1.8%)
41 (< 0.1%)
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box), select some columns using the checkboxes next
to the column names or the "Select all" button.
The table below shows the strength of association between the most similar columns in the dataframe.
Cramér's V statistic is a number between 0 and 1.
When it is close to 1 the columns are strongly associated — they contain similar information.
In this case, one of them may be redundant and for some models (such as linear models) it might be beneficial to remove it.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
The “Stats” tab provides a variety of descriptive statistics for each column in the dataset. This includes:
The column name
The detected data type of the column
Whether the column is sorted or not
The number of null values in the column, as well as the percentage
The number of unique values in the column
For numerical columns, additional statistics are provided:
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Is sorted
Null values
Unique values
Mean
Std
Min
Median
Max
0
age
Int64DType
False
0 (0.0%)
74 (0.2%)
38.6
13.7
17
37
90
1
workclass
CategoricalDtype
False
2799 (5.7%)
8 (< 0.1%)
2
fnlwgt
Int64DType
False
0 (0.0%)
28523 (58.4%)
1.90e+05
1.06e+05
12,285
178,142
1,490,400
3
education
CategoricalDtype
False
0 (0.0%)
16 (< 0.1%)
4
education-num
Int64DType
False
0 (0.0%)
16 (< 0.1%)
10.1
2.57
1
10
16
5
marital-status
CategoricalDtype
False
0 (0.0%)
7 (< 0.1%)
6
occupation
CategoricalDtype
False
2809 (5.8%)
14 (< 0.1%)
7
relationship
CategoricalDtype
False
0 (0.0%)
6 (< 0.1%)
8
race
CategoricalDtype
False
0 (0.0%)
5 (< 0.1%)
9
sex
CategoricalDtype
False
0 (0.0%)
2 (< 0.1%)
10
capital-gain
Int64DType
False
0 (0.0%)
123 (0.3%)
1.08e+03
7.45e+03
0
0
99,999
11
capital-loss
Int64DType
False
0 (0.0%)
99 (0.2%)
87.5
403.
0
0
4,356
12
hours-per-week
Int64DType
False
0 (0.0%)
96 (0.2%)
40.4
12.4
1
40
99
13
native-country
CategoricalDtype
False
857 (1.8%)
41 (< 0.1%)
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box), select some columns using the checkboxes next
to the column names or the "Select all" button.
The table below shows the strength of association between the most similar columns in the dataframe.
Cramér's V statistic is a number between 0 and 1.
When it is close to 1 the columns are strongly associated — they contain similar information.
In this case, one of them may be redundant and for some models (such as linear models) it might be beneficial to remove it.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
The “Distributions” tab provides visualizations of the distributions of values in each column. This includes histograms for numerical columns and bar plots for categorical columns.
The “Distributions” tab helps with detecting potential issues in the data, such as:
Skewed distributions
Outliers
Unexpected value frequencies
For example, in this dataset we can see that some columns are heavily skewed, such as “workclass”, “race”, and “native-country”: this is important information to keep track of, because these columns may require special handling during data preprocessing or modeling.
Additionally, the “Distributions” tab allows so select columns manually, so that they can be added to a script and selected for further analysis or modeling.
Caution
The TableReport detects outliers using a simple interquartile test, marking as outliers all values that are beyond the IQR. This is a simple heuristic, and should not be treated as perfect. If your problem requires reliable outlier detection, you should not rely exclusively on what the TableReport shows.
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Is sorted
Null values
Unique values
Mean
Std
Min
Median
Max
0
age
Int64DType
False
0 (0.0%)
74 (0.2%)
38.6
13.7
17
37
90
1
workclass
CategoricalDtype
False
2799 (5.7%)
8 (< 0.1%)
2
fnlwgt
Int64DType
False
0 (0.0%)
28523 (58.4%)
1.90e+05
1.06e+05
12,285
178,142
1,490,400
3
education
CategoricalDtype
False
0 (0.0%)
16 (< 0.1%)
4
education-num
Int64DType
False
0 (0.0%)
16 (< 0.1%)
10.1
2.57
1
10
16
5
marital-status
CategoricalDtype
False
0 (0.0%)
7 (< 0.1%)
6
occupation
CategoricalDtype
False
2809 (5.8%)
14 (< 0.1%)
7
relationship
CategoricalDtype
False
0 (0.0%)
6 (< 0.1%)
8
race
CategoricalDtype
False
0 (0.0%)
5 (< 0.1%)
9
sex
CategoricalDtype
False
0 (0.0%)
2 (< 0.1%)
10
capital-gain
Int64DType
False
0 (0.0%)
123 (0.3%)
1.08e+03
7.45e+03
0
0
99,999
11
capital-loss
Int64DType
False
0 (0.0%)
99 (0.2%)
87.5
403.
0
0
4,356
12
hours-per-week
Int64DType
False
0 (0.0%)
96 (0.2%)
40.4
12.4
1
40
99
13
native-country
CategoricalDtype
False
857 (1.8%)
41 (< 0.1%)
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box), select some columns using the checkboxes next
to the column names or the "Select all" button.
The table below shows the strength of association between the most similar columns in the dataframe.
Cramér's V statistic is a number between 0 and 1.
When it is close to 1 the columns are strongly associated — they contain similar information.
In this case, one of them may be redundant and for some models (such as linear models) it might be beneficial to remove it.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
The “Associations” tab provides insights into the relationships between different columns in the dataset. It shows Pearson’s correlation coefficients for numerical columns, as well as Cramér’s V for all columns.
While this is a somewhat rough measure of association, it can help identify potential relationships worth exploring further during the analysis, and highlights highly correlated columns: depending on the modeling technique used, these may need to be handled specially to avoid issues with multicollinearity.
In this example, we can see that “education-num” and “education” have perfect correlation, which means that one of the two columns can be dropped without losing information.
3.4 Exploring the target variable
Let’s take a closer look at the target variable, which indicates whether an individual’s income exceeds $50K per year. We can create a separate TableReport for the target variable to explore its distribution:
TableReport(target)
Processing column 1 / 1
Click a table cell for more info about its column.
class
0
<=50K
1
<=50K
2
>50K
3
>50K
4
<=50K
48,837
<=50K
48,838
>50K
48,839
<=50K
48,840
<=50K
48,841
>50K
class
CategoricalDtype
Null values
0 (0.0%)
Unique values
2 (< 0.1%)
Most frequent values
<=50K
>50K
List:
['<=50K', '>50K']
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Is sorted
Null values
Unique values
Mean
Std
Min
Median
Max
0
class
CategoricalDtype
False
0 (0.0%)
2 (< 0.1%)
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box), select some columns using the checkboxes next
to the column names or the "Select all" button.
class
CategoricalDtype
Null values
0 (0.0%)
Unique values
2 (< 0.1%)
Most frequent values
<=50K
>50K
List:
['<=50K', '>50K']
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
No strong associations between any pair of columns were identified by a quick screening of a subsample of the dataframe.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
3.5 Configuring and saving the TableReport
The TableReport can be saved on disk as an HTML.
TableReport(data).write_html("report.html")
Then, the report can be opened using any internet browser, with no need to run a Jupyter notebok or a python interactive console.
It is possible to configure various parameters using the skrub global config. For example, it is possible to replace the default Pandas or Polars dataframe display with the TableReport as follows:
from skrub import set_configset_config(use_table_report=True)data
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Is sorted
Null values
Unique values
Mean
Std
Min
Median
Max
0
age
Int64DType
False
0 (0.0%)
74 (0.2%)
38.6
13.7
17
37
90
1
workclass
CategoricalDtype
False
2799 (5.7%)
8 (< 0.1%)
2
fnlwgt
Int64DType
False
0 (0.0%)
28523 (58.4%)
1.90e+05
1.06e+05
12,285
178,142
1,490,400
3
education
CategoricalDtype
False
0 (0.0%)
16 (< 0.1%)
4
education-num
Int64DType
False
0 (0.0%)
16 (< 0.1%)
10.1
2.57
1
10
16
5
marital-status
CategoricalDtype
False
0 (0.0%)
7 (< 0.1%)
6
occupation
CategoricalDtype
False
2809 (5.8%)
14 (< 0.1%)
7
relationship
CategoricalDtype
False
0 (0.0%)
6 (< 0.1%)
8
race
CategoricalDtype
False
0 (0.0%)
5 (< 0.1%)
9
sex
CategoricalDtype
False
0 (0.0%)
2 (< 0.1%)
10
capital-gain
Int64DType
False
0 (0.0%)
123 (0.3%)
1.08e+03
7.45e+03
0
0
99,999
11
capital-loss
Int64DType
False
0 (0.0%)
99 (0.2%)
87.5
403.
0
0
4,356
12
hours-per-week
Int64DType
False
0 (0.0%)
96 (0.2%)
40.4
12.4
1
40
99
13
native-country
CategoricalDtype
False
857 (1.8%)
41 (< 0.1%)
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box), select some columns using the checkboxes next
to the column names or the "Select all" button.
The table below shows the strength of association between the most similar columns in the dataframe.
Cramér's V statistic is a number between 0 and 1.
When it is close to 1 the columns are strongly associated — they contain similar information.
In this case, one of them may be redundant and for some models (such as linear models) it might be beneficial to remove it.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
3.6 Working with big tables
Plotting and measuring the column correlations are expensive operations and may take a long time, so when the dataframe under study is large it may be more convenient to skip them.
The max_plot_columns and max_association_columns parameters allow to set a threshold on the number of columns: the TableReport will skip the respective task if the number of colums in the dataframe is larger than the threshold:
Click a table cell for more info about its column.
age
workclass
fnlwgt
education
education-num
marital-status
occupation
relationship
race
sex
capital-gain
capital-loss
hours-per-week
native-country
0
25
Private
226,802
11th
7
Never-married
Machine-op-inspct
Own-child
Black
Male
0
0
40
United-States
1
38
Private
89,814
HS-grad
9
Married-civ-spouse
Farming-fishing
Husband
White
Male
0
0
50
United-States
2
28
Local-gov
336,951
Assoc-acdm
12
Married-civ-spouse
Protective-serv
Husband
White
Male
0
0
40
United-States
3
44
Private
160,323
Some-college
10
Married-civ-spouse
Machine-op-inspct
Husband
Black
Male
7,688
0
40
United-States
4
18
103,497
Some-college
10
Never-married
Own-child
White
Female
0
0
30
United-States
48,837
27
Private
257,302
Assoc-acdm
12
Married-civ-spouse
Tech-support
Wife
White
Female
0
0
38
United-States
48,838
40
Private
154,374
HS-grad
9
Married-civ-spouse
Machine-op-inspct
Husband
White
Male
0
0
40
United-States
48,839
58
Private
151,910
HS-grad
9
Widowed
Adm-clerical
Unmarried
White
Female
0
0
40
United-States
48,840
22
Private
201,490
HS-grad
9
Never-married
Adm-clerical
Own-child
White
Male
0
0
20
United-States
48,841
52
Self-emp-inc
287,927
HS-grad
9
Married-civ-spouse
Exec-managerial
Wife
White
Female
15,024
0
40
United-States
age
Int64DType
Null values
0 (0.0%)
Unique values
74 (0.2%)
This column has a high cardinality (> 40).
Mean ± Std
38.6 ±
13.7
Median ± IQR
37 ±
20
Min | Max
17 |
90
workclass
CategoricalDtype
Null values
2,799 (5.7%)
Unique values
8 (< 0.1%)
fnlwgt
Int64DType
Null values
0 (0.0%)
Unique values
28,523 (58.4%)
This column has a high cardinality (> 40).
Mean ± Std
1.90e+05 ±
1.06e+05
Median ± IQR
178,142 ±
120,097
Min | Max
12,285 |
1,490,400
education
CategoricalDtype
Null values
0 (0.0%)
Unique values
16 (< 0.1%)
education-num
Int64DType
Null values
0 (0.0%)
Unique values
16 (< 0.1%)
Mean ± Std
10.1 ±
2.57
Median ± IQR
10 ±
3
Min | Max
1 |
16
marital-status
CategoricalDtype
Null values
0 (0.0%)
Unique values
7 (< 0.1%)
occupation
CategoricalDtype
Null values
2,809 (5.8%)
Unique values
14 (< 0.1%)
relationship
CategoricalDtype
Null values
0 (0.0%)
Unique values
6 (< 0.1%)
race
CategoricalDtype
Null values
0 (0.0%)
Unique values
5 (< 0.1%)
sex
CategoricalDtype
Null values
0 (0.0%)
Unique values
2 (< 0.1%)
capital-gain
Int64DType
Null values
0 (0.0%)
Unique values
123 (0.3%)
This column has a high cardinality (> 40).
Mean ± Std
1.08e+03 ±
7.45e+03
Median ± IQR
0 ±
0
Min | Max
0 |
99,999
capital-loss
Int64DType
Null values
0 (0.0%)
Unique values
99 (0.2%)
This column has a high cardinality (> 40).
Mean ± Std
87.5 ±
403.
Median ± IQR
0 ±
0
Min | Max
0 |
4,356
hours-per-week
Int64DType
Null values
0 (0.0%)
Unique values
96 (0.2%)
This column has a high cardinality (> 40).
Mean ± Std
40.4 ±
12.4
Median ± IQR
40 ±
5
Min | Max
1 |
99
native-country
CategoricalDtype
Null values
857 (1.8%)
Unique values
41 (< 0.1%)
This column has a high cardinality (> 40).
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Is sorted
Null values
Unique values
Mean
Std
Min
Median
Max
0
age
Int64DType
False
0 (0.0%)
74 (0.2%)
38.6
13.7
17
37
90
1
workclass
CategoricalDtype
False
2799 (5.7%)
8 (< 0.1%)
2
fnlwgt
Int64DType
False
0 (0.0%)
28523 (58.4%)
1.90e+05
1.06e+05
12,285
178,142
1,490,400
3
education
CategoricalDtype
False
0 (0.0%)
16 (< 0.1%)
4
education-num
Int64DType
False
0 (0.0%)
16 (< 0.1%)
10.1
2.57
1
10
16
5
marital-status
CategoricalDtype
False
0 (0.0%)
7 (< 0.1%)
6
occupation
CategoricalDtype
False
2809 (5.8%)
14 (< 0.1%)
7
relationship
CategoricalDtype
False
0 (0.0%)
6 (< 0.1%)
8
race
CategoricalDtype
False
0 (0.0%)
5 (< 0.1%)
9
sex
CategoricalDtype
False
0 (0.0%)
2 (< 0.1%)
10
capital-gain
Int64DType
False
0 (0.0%)
123 (0.3%)
1.08e+03
7.45e+03
0
0
99,999
11
capital-loss
Int64DType
False
0 (0.0%)
99 (0.2%)
87.5
403.
0
0
4,356
12
hours-per-week
Int64DType
False
0 (0.0%)
96 (0.2%)
40.4
12.4
1
40
99
13
native-country
CategoricalDtype
False
857 (1.8%)
41 (< 0.1%)
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box), select some columns using the checkboxes next
to the column names or the "Select all" button.
Plotting was skipped as the dataframe exceeded the
max_plot_columns
limit set for the TableReport during report creation.
You can adjust this behavior in several ways:
To fix it for a single report, change the parameter passed when creating TableReport (default: 30):
report = TableReport(max_plot_columns=50)
To change the default number of columns to be plotted during the current Python session, use skrub.set_config:
from skrub import set_config
set_config(max_plot_columns=50)
To make the change permanent, use an environment variable:
export SKB_MAX_PLOT_COLUMNS=50
Note that TableReport accepts two parameters:
The max_plot_columns limits the number of columns that are plotted.
The max_association_columns limits the number of columns for which associations should be computed.
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Computing pairwise associations was skipped because the dataframe has many columns.
See TableReport's max_association_columns parameter.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
When the number of columns is too large, an information message is shown in the respective tab instead of the plots or correlations.
4 Exercise: exploring a new table
For this exercise, we will use the employee_salaries dataframe to answer some questions.
Run the following code to import the dataframe:
from skrub.datasets import fetch_employee_salariesemployee_salaries = fetch_employee_salaries()data = employee_salaries.X
Now use the skrub TableReport and answer the following questions:
['Department of Police', 'Department of Health and Human Services', 'Fire and Rescue Services', 'Department of Transportation', 'Correction and Rehabilitation', 'Department of Liquor Control', 'Department of General Services', 'Department of Public Libraries', 'Department of Permitting Services', "Sheriff's Office"]
division
ObjectDType
Null values
0 (0.0%)
Unique values
694 (7.5%)
This column has a high cardinality (> 40).
Most frequent values
School Health Services
Transit Silver Spring Ride On
Transit Gaithersburg Ride On
Highway Services
Child Welfare Services
FSB Traffic Division School Safety Section
Income Supports
PSB 3rd District Patrol
PSB 4th District Patrol
Transit Nicholson Ride On
List:
['School Health Services', 'Transit Silver Spring Ride On', 'Transit Gaithersburg Ride On', 'Highway Services', 'Child Welfare Services', 'FSB Traffic Division School Safety Section', 'Income Supports', 'PSB 3rd District Patrol', 'PSB 4th District Patrol', 'Transit Nicholson Ride On']
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Is sorted
Null values
Unique values
Mean
Std
Min
Median
Max
0
gender
ObjectDType
False
17 (0.2%)
2 (< 0.1%)
1
department
ObjectDType
False
0 (0.0%)
37 (0.4%)
2
department_name
ObjectDType
False
0 (0.0%)
37 (0.4%)
3
division
ObjectDType
False
0 (0.0%)
694 (7.5%)
4
assignment_category
ObjectDType
False
0 (0.0%)
2 (< 0.1%)
5
employee_position_title
ObjectDType
False
0 (0.0%)
443 (4.8%)
6
date_first_hired
ObjectDType
False
0 (0.0%)
2264 (24.5%)
7
year_first_hired
Int64DType
False
0 (0.0%)
51 (0.6%)
2.00e+03
9.33
1,965
2,005
2,016
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box), select some columns using the checkboxes next
to the column names or the "Select all" button.
['Department of Police', 'Department of Health and Human Services', 'Fire and Rescue Services', 'Department of Transportation', 'Correction and Rehabilitation', 'Department of Liquor Control', 'Department of General Services', 'Department of Public Libraries', 'Department of Permitting Services', "Sheriff's Office"]
division
ObjectDType
Null values
0 (0.0%)
Unique values
694 (7.5%)
This column has a high cardinality (> 40).
Most frequent values
School Health Services
Transit Silver Spring Ride On
Transit Gaithersburg Ride On
Highway Services
Child Welfare Services
FSB Traffic Division School Safety Section
Income Supports
PSB 3rd District Patrol
PSB 4th District Patrol
Transit Nicholson Ride On
List:
['School Health Services', 'Transit Silver Spring Ride On', 'Transit Gaithersburg Ride On', 'Highway Services', 'Child Welfare Services', 'FSB Traffic Division School Safety Section', 'Income Supports', 'PSB 3rd District Patrol', 'PSB 4th District Patrol', 'Transit Nicholson Ride On']
The table below shows the strength of association between the most similar columns in the dataframe.
Cramér's V statistic is a number between 0 and 1.
When it is close to 1 the columns are strongly associated — they contain similar information.
In this case, one of them may be redundant and for some models (such as linear models) it might be beneficial to remove it.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
4.1 Questions
What’s the size of the dataframe? (columns and rows)
How many columns have object/numerical/datetime
Are there columns with a large number of missing values?
Are there columns that have a high cardinality (>40 unique values)?
Were datetime columns parsed correctly?
Which columns have outliers?
Which columns have an imbalanced distribution?
Which columns are strongly correlated with each other?
# PLACEHOLDER#########
4.2 Answers
What’s the size of the dataframe? (columns and rows)
9228 rows × 8 columns
How many columns have object/numerical/datetime
No datetime columns, one integer column (year_first_hired), all other columns are objects.
Are there columns with a large number of missing values?
No, only the gender column contains a small fraction (0.2%) of missing values.
Are there columns that have a high cardinality?
Yes, division, employee_position_title, date_first_hired have a cardinality larger than 40.
Were datetime columns parsed correctly?
No, the date_first_hired column has dtype Object.
Which columns have outliers?
No columns seem to include outliers.
Which columns have an imbalanced distribution?
assignment_category has an unbalanced distribution.
Which columns are strongly correlated with each other?
department and department_name have a Cramer’s V of 1, so they are very strongly correlated.